Laura Vermeren, Laura Symul, Lara Wautier, Céline Bugli
Published
June 17, 2025
The purpose of this document is to create
a participants table with critical participant-invariant information (arm, site, etc.), including the population flags along with a
participants_variable_dictionary
a visits_long table with the visit_code and study_day for each participant and each CRFs. This table is used to identify residual discrepancies between CRFs in visit_code and study_day (and harmonize them if needed).
a visits table with one row per pid and visit_code and the following columns
study_day harmonized study_day across CRFs
visit_type: whether it is a clinic or home visit
visit_planned: whether the visit was planned or additional
visit_attended: whether the visit was attended or not. For home visits, this notes whether we have any CRF data for that visit
specimen_collection_swab (collected, not collected, unclear) [CRF35]
specimen_collection_softcup (collected, not collected, unclear) [CRF35]
specimen_collection_cytobrush (collected, not collected, unclear) [CRF35]
specimen_collection_home_swab (collected, not collected, unclear) [CRF47]
along with a
visits_variable_dictionary
Load the crf data processed in the previous step 01a clinical CRF data cleaning.qmd
# For now, we do not know the arms yet, only whether participants were in the overla arm or not.participants <- participants |>mutate(arm =case_when( group4 =="Yes"~"LC-106-o", group4 =="No"~"Blinded",is.na(group4) ~NA_character_ ) |>factor(levels =c("Pl", "LC-106-7", "LC-106-3", "LC-106-o", "LC-115", "Blinded") ) ) |>select(-group4)
Code
participants |> dplyr::count(site, randomized, arm) |>gt("Number of randomized participants per site and per arm")
site
randomized
arm
n
CAP
FALSE
NA
393
CAP
TRUE
LC-106-o
15
CAP
TRUE
Blinded
57
MGH
FALSE
NA
43
MGH
TRUE
LC-106-o
1
MGH
TRUE
Blinded
23
Population flags (ITT, mITT and PP)
The study population flags (ITT, mITT, PP) are built using the following steps and criteria:
We define an eligibility flag from CRF7 (we take the last screening visit for participants who came twice);
We create two variables related to sample collection from CRF35 data (Specimen collection)
n_swabs_post_product that counts the total number of swabs after product administration (i.e., from the first follow-up visit, which is visit 1200 for all participants, including LC-106-o participants))
n_swabs_post_product_prior_week7 that counts the total number of swabs from at clinic visits 1200 to 1500 (incl.).
We identify replaced participants from CRF44 (Replacement) to exclude participants that have been replaced from the PP;
We determine the number of missed doses from CRF23 (Product Use) and/or our “consolidated” product exposure table to identify participants who took at least one dose of study product (mITT) or 80% of the doses (PP).
Finally, from these variables, we can define:
ITT:
all randomized participants (have a treatment code in CRF13)
mITT:
ITT &
met eligibility (CRF7) &
n_product_doses > 0 &
n_swabs_post_product > 0
PP:
ITT &
not replaced (CRF44) &
did not meet replacement criteria (CRFXXX, XXX, and XXX) &
# COLLECTED SWABS# We add information from CRF35 about collected swabs at in-person visitscrf35_sub <- crf_clean$crf35 |>select(pid, dfseq, self_collected_swabs) |>mutate(pid = pid |>as.character()) |>group_by(pid) |># we count how many swabs were collected after the end of intervention (i.e., from visit 1200, for all intervention arms) mutate(n_swabs_post_product =sum(self_collected_swabs[dfseq >=1200], na.rm =TRUE),n_swabs_post_product_prior_week7 =sum(self_collected_swabs[(dfseq >=1200) & (dfseq <=1500)], na.rm =TRUE) ) |>ungroup() |>select(pid, n_swabs_post_product, n_swabs_post_product_prior_week7) |>distinct()participants <- participants |>select(-any_of(c("n_swabs_post_product", "n_swabs_post_product_prior_week7"))) |>left_join(crf35_sub, by ="pid") |>mutate(n_swabs_post_product = n_swabs_post_product |>replace_na(0),n_swabs_post_product_prior_week7 = n_swabs_post_product_prior_week7 |>replace_na(0) )rm(crf35_sub)participants |>count(randomized, n_swabs_post_product >0, n_swabs_post_product_prior_week7 >0) |>gt()
randomized
n_swabs_post_product > 0
n_swabs_post_product_prior_week7 > 0
n
FALSE
FALSE
FALSE
436
TRUE
FALSE
FALSE
4
TRUE
TRUE
FALSE
1
TRUE
TRUE
TRUE
91
Code
participants |>filter(randomized, n_swabs_post_product >0, n_swabs_post_product_prior_week7 ==0) |>left_join(crf_clean$crf35 |>mutate(pid = pid |>as.character()), by =join_by(pid)) |> gt::gt(caption ="Participants that had swabs collected in the extended follow-up period but not in the follow-up period.")
Participants that had swabs collected in the extended follow-up period but not in the follow-up period.
pid
site
location
randomized
arm
meet_eligibility
n_swabs_post_product
n_swabs_post_product_prior_week7
uid
visit_code
dfseq
vdate_fixed
study_day
Softcup time placed
Softcup time removed
specimen1_specify
Other specimen 2 specify
Softcup
Vaginal swab
Sti testing
Amsel criteria and nugent score
Number of self collected swabs
Pap smear
Cervicovaginal swab for hpv
Endocervical cytobrush
Rectal swab
Urine
Urinalysis
Poc pregnancy
Blood sample
Blood for testing hiv
Blood for testing syphilis
Blood for testing hsv
Complete blood count
Blood for testing blood type
Blood for research tube
Other specimen 1
Other specimen 2
Were home swabs collected?
Number of swabs by participant
Condition of collected home swabs
softcup_time_placed_t
softcup_time_removed_t
softcup_collection_duration
068100050
MGH
US
TRUE
Blinded
TRUE
6
0
068100050_0000
0000
0
FALSE
-5
12:21
12:45
psa
NA
Checked
Checked
Checked
Checked
8
Checked
Checked
Unchecked
Checked
Checked
Unchecked
Checked
Checked
Checked
Checked
Checked
Checked
Checked
Unchecked
Checked
Unchecked
No
NA
Blank
0-01-01 12:21:00
0-01-01 12:45:00
24
068100050
MGH
US
TRUE
Blinded
TRUE
6
0
068100050_1000
1000
1000
FALSE
1
12:21
12:45
psa
NA
Checked
Checked
Unchecked
Checked
6
Unchecked
Unchecked
Checked
Checked
Checked
Checked
Checked
Checked
Unchecked
Unchecked
Unchecked
Unchecked
Unchecked
Checked
Checked
Unchecked
No
NA
Blank
0-01-01 12:21:00
0-01-01 12:45:00
24
068100050
MGH
US
TRUE
Blinded
TRUE
6
0
068100050_1100
1100
1100
FALSE
8
12:27
12:50
psa
NA
Checked
Checked
Unchecked
Checked
8
Unchecked
Unchecked
Checked
Checked
Checked
Checked
Checked
Checked
Unchecked
Unchecked
Unchecked
Unchecked
Unchecked
Checked
Checked
Unchecked
Yes
NA
Blank
0-01-01 12:27:00
0-01-01 12:50:00
23
068100050
MGH
US
TRUE
Blinded
TRUE
6
0
068100050_1700
1700
1700
FALSE
51
13:08
13:29
NA
NA
Checked
Checked
Checked
Checked
6
Unchecked
Checked
Checked
Checked
Checked
Checked
Unchecked
Checked
Checked
Checked
Checked
Unchecked
Unchecked
Checked
Checked
Unchecked
No
NA
Blank
0-01-01 13:08:00
0-01-01 13:29:00
21
Code
# REPLACEMENT# we check if participants have been replacedcrf44_sub <- crf_clean$crf44 |>select(pid, reason_replacement) |>mutate(replaced =TRUE) participants <- participants |>select(-any_of(c("replaced"))) |>left_join(crf44_sub |>select(pid, replaced) |>mutate(pid = pid |>as.character()), by ="pid") |>mutate(replaced = replaced |>replace_na(FALSE)) rm(crf44_sub)# or if they met any of the replacement criteria# from SAP:# 3.6.3 Bleeding# If a participant has 3 days of heavy bleeding (i.e. soaking a pad at least once in a day) during study product dosing, or more than 5 days in a row of heavy bleeding during follow-up, an additional participant will be added to that study arm in the remaining randomization matrix.# 3.6.4 Antibiotics# If a participant needs antibiotic treatment for any reason (recurrent BV, urinary tract infection, other) before the week 5 visit, an additional participant will be added to that study arm in the remaining randomization matrix.# 3.6.5 HSIL pap # If the Pap smear obtained at the screening visit is read as HSIL, an additional participant will be added to that study arm in the remaining randomization matrix.# we use CRF 33 (daily diary) for bleedingcrf33_sub <- crf_clean$crf33 |>select(pid, visit_code, vaginal_bleeding) |>left_join(crf_clean$crf13 |>select(pid, group4), by =join_by(pid)) |>mutate(period =case_when( (visit_code <1000) ~"pre-mtz", (group4 =="No") & (visit_code %in%1000:1099) ~"mtz", (group4 =="Yes") & (visit_code %in%1001:1200) ~"product", (group4 =="No") & (visit_code %in%1100:1200) ~"product", (visit_code %in%1201:1500) ~"follow-up",TRUE~NA_character_ ) ) |>filter(period %in%c("product", "follow-up")) |>group_by(pid, period) |>summarize(n_heavy_bleeding =sum(vaginal_bleeding =="Yes - Heavy", na.rm =TRUE), n_max_consecutive_heavy_bleeding =max(rle(vaginal_bleeding =="Yes - Heavy")$lengths[rle(vaginal_bleeding =="Yes - Heavy")$values],na.rm =TRUE ),.groups ="drop" ) |>mutate(replacement_criteria_bleeding =case_when( (period =="product"& n_heavy_bleeding >=3) ~TRUE, (period =="follow-up"& n_max_consecutive_heavy_bleeding >=5) ~TRUE,TRUE~FALSE ) ) |>group_by(pid) |>summarise(replacement_criteria_bleeding =any(replacement_criteria_bleeding, na.rm =TRUE),.groups ="drop" )
Warning: There were 175 warnings in `summarize()`.
The first warning was:
ℹ In argument: `n_max_consecutive_heavy_bleeding = max(...)`.
ℹ In group 1: `pid = "068100004"` `period = "follow-up"`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 174 remaining warnings.
Code
# 3.6.4 Antibiotics# If a participant needs antibiotic treatment for any reason (recurrent BV, urinary tract infection, other) before the week 5 visit, an additional participant will be added to that study arm in the remaining randomization matrix.# we use CRF 20 (follow-up questionnaire) for antibioticscrf20_sub <- crf_clean$crf20 |>select(pid, visit_code, used_during_past_week) |>filter(visit_code >1000, visit_code <=1500) |>group_by(pid) |>summarize(any_antibiotics =any(str_detect(used_during_past_week, "antibiotic")) )# 3.6.5 HSIL pap # If the Pap smear obtained at the screening visit is read as HSIL, an additional participant will be added to that study arm in the remaining randomization matrix# we use CRF 15 (pap smear and blood) for HSIL pap crf15_sub <- crf_clean$crf15 |>select(pid, visit_code, pap_smear_results) |>filter(visit_code <=1000) |>group_by(pid) |>summarize(any_HSIL =any(str_detect(pap_smear_results, "HSIL")) )participants <- participants |>select(-any_of(c("replacement_criteria_bleeding"))) |>left_join(crf33_sub, by ="pid") |>mutate(replacement_criteria_bleeding = replacement_criteria_bleeding |>replace_na(FALSE)) |>select(-any_of(c("any_antibiotics"))) |>left_join(crf20_sub, by ="pid") |>mutate(any_antibiotics = any_antibiotics |>replace_na(FALSE)) |>select(-any_of(c("any_HSIL"))) |>left_join(crf15_sub, by ="pid") |>mutate(any_HSIL = any_HSIL |>replace_na(FALSE)) |>mutate(meets_replacement_criteria = replacement_criteria_bleeding | any_antibiotics | any_HSIL) participants |>count(randomized, replaced, meets_replacement_criteria) |>gt(caption ="Replacement criteria")
Replacement criteria
randomized
replaced
meets_replacement_criteria
n
FALSE
FALSE
FALSE
433
FALSE
FALSE
TRUE
3
TRUE
FALSE
FALSE
91
TRUE
FALSE
TRUE
3
TRUE
TRUE
FALSE
2
Code
# PRODUCT USE# we add information about product use# two ways: 1 = CRF23 aggregationcrf23_sub <- crf_clean$crf23 |>select(pid, dfseq, n_missed_days) |>group_by(pid) |>summarize(n_missed_days =case_when(all(is.na(n_missed_days)) ~NA_integer_,TRUE~sum(n_missed_days, na.rm =TRUE) ) ) |>mutate(n_study_product_doses =7- n_missed_days)# 2. exposures table# we use exposures table to get the number of study product dosesexposures_studyproduct_summary <- exposures |>mutate(pid = pid |>as.character()) |>group_by(pid) |>summarise(n_study_product_doses =sum(as.numeric(study_product), na.rm =TRUE), ) |>mutate(n_missed_days =7- n_study_product_doses )full_join( crf23_sub, exposures_studyproduct_summary,by ="pid") |>filter(is.na(n_study_product_doses.x) |is.na(n_study_product_doses.y) | (n_study_product_doses.x != n_study_product_doses.y) ) |>left_join( participants |>select(pid, randomized, arm),by ="pid" ) |>gt(caption ="Disagreement between CRF23 and exposures table" )
# create the flag participants <- participants |>mutate(ITT = randomized,mITT = ITT & meet_eligibility & (n_study_product_doses_crf23 >0) & (n_swabs_post_product >0),PP = ITT &!replaced &!meets_replacement_criteria & (n_study_product_doses_crf23 >0.8*7) & (n_swabs_post_product_prior_week7 >0),mITT_2 = ITT & meet_eligibility & (n_study_product_doses_exposures >0) & (n_swabs_post_product >0),PP_2 = ITT &!replaced &!meets_replacement_criteria & (n_study_product_doses_exposures >0.8*7) & (n_swabs_post_product_prior_week7 >0) )
Code
participants |>count( randomized, ITT, mITT, PP ) |>gt()
randomized
ITT
mITT
PP
n
FALSE
FALSE
FALSE
FALSE
436
TRUE
TRUE
FALSE
FALSE
6
TRUE
TRUE
TRUE
FALSE
5
TRUE
TRUE
TRUE
TRUE
85
Code
# From Lara Lewis# Here are my sample sizes:# ITT - 96participants |>count(ITT) # same# MITT - 90 (excludes those with no follow-up / product use: 100016 100045 100050 100056 200350 200457)participants |>count(mITT) # sameparticipants |>filter(ITT, !mITT) |>pull(pid) # same# SAF – 91 (excludes those who did not receive product or did not have post LBP follow-up: 100016 100045 100056 200350 200457)# PP – 86 (MITT and additionally excludes those meeting replacement criteria (2) and using <80% doses (2): 100004 (+) 100016 100045 100050 100053 (+) 100056 200253 (+) 200350 200457 200465 (+))participants |>count(PP) # I have 85 (one less)participants |>filter(ITT, !PP) |>pull(pid) # I have 068100040 in addition to those listed by L.L.# I believe she should be excluded from the PP population because she took oral antibiotic between visit 1300 and 1400 participants |>filter(randomized) |>arrange(ITT, mITT, PP, pid) |>select(pid, ITT, mITT, PP, everything()) |>View()# I have # ITT - 96# mITT - 90 same# PP - 85 and I have extra: 068100040,
Code
participants |>filter( randomized, mITT_2 != mITT ) |>View()# 100050 reported using product in daily diaries but does not have CRF23 data; also, while she did not present at visits 1200:1500, she presented at visit 1700 -> I believe she needs to be included in the mITT, but not sure...
write_csv( participants |>filter(randomized) |>arrange(ITT, mITT, mITT_2, PP, PP_2) |>mutate(PP_LL =ifelse(pid %in%c("100004", "200465"), FALSE, PP),comment =case_when( pid =="100004"~"She has missing reports in CRF23, but she filled her daily diaries `properly` and they suggest thay she took all of her doses. What do you think?", pid =="200465"~"She is a group4 participant, so, she had 4 remaining tablets at one visit, but none at the following one, and her daily diaries suggest that she took all the doses. What do you think?", pid =="100050"~"She reported using product in daily diaries but does not have CRF23 data; also, while she did not present at visits 1200:1500, she presented at visit 1700 -> I believe she needs to be included in the mITT, but not sure... what do you think?",TRUE~NA_character_ ) ) |>relocate(PP_LL, .after = PP_2), file ="population_flags.csv" )
Dictionary
Code
participants_variable_dictionary <-tibble(variable =colnames(participants) ) |>mutate(description =case_when( variable =="pid"~"Participant ID", variable =="site"~"Study site (MGH or CAP)", variable =="location"~"Location of the participant (US or SA)", variable =="randomized"~"Randomized participant (TRUE or FALSE)", variable =="arm"~"Randomization arm", variable =="ITT"~"Intention-to-treat population flag (TRUE or FALSE)", variable =="mITT"~"Modified intention-to-treat population flag (TRUE or FALSE)", variable =="PP"~"Per-protocol population flag (TRUE or FALSE)", variable =="meet_eligibility"~"Eligibility flag (TRUE or FALSE): whether participant met eligibility criteria at their last screening visit", variable =="n_swabs_post_product"~"Number of swabs collected from visit 1200 (incl.)", variable =="n_swabs_post_product_prior_week7"~"Number of swabs collected from visit 1200 to 1500 (incl.)", variable =="replaced"~"Replacement flag (TRUE or FALSE): whether participant was replaced", variable =="replacement_criteria_bleeding"~"Whether participant met replacement criteria related to bleeding.", variable =="any_antibiotics"~"Whether participant used any antibiotics before visit 1500.", variable =="any_HSIL"~"Whether participant had HSIL pap at screening visit.", variable =="meets_replacement_criteria"~"Whether participant meets any of the replacement criteria (bleeding, antibiotics, HSIL pap)", variable =="n_study_product_doses_crf23"~"Number of doses taken (as reported in CRF23; missing values imputed to 0)", variable =="n_study_product_doses_exposures"~"Number of doses taken (as reported in exposures table)",TRUE~"???" ) )
Code
participants_variable_dictionary |>gt()
variable
description
pid
Participant ID
site
Study site (MGH or CAP)
location
Location of the participant (US or SA)
randomized
Randomized participant (TRUE or FALSE)
arm
Randomization arm
ITT
Intention-to-treat population flag (TRUE or FALSE)
mITT
Modified intention-to-treat population flag (TRUE or FALSE)
PP
Per-protocol population flag (TRUE or FALSE)
meet_eligibility
Eligibility flag (TRUE or FALSE): whether participant met eligibility criteria at their last screening visit
n_swabs_post_product
Number of swabs collected from visit 1200 (incl.)
n_swabs_post_product_prior_week7
Number of swabs collected from visit 1200 to 1500 (incl.)
replaced
Replacement flag (TRUE or FALSE): whether participant was replaced
replacement_criteria_bleeding
Whether participant met replacement criteria related to bleeding.
any_antibiotics
Whether participant used any antibiotics before visit 1500.
any_HSIL
Whether participant had HSIL pap at screening visit.
meets_replacement_criteria
Whether participant meets any of the replacement criteria (bleeding, antibiotics, HSIL pap)
n_study_product_doses_crf23
Number of doses taken (as reported in CRF23; missing values imputed to 0)
n_study_product_doses_exposures
Number of doses taken (as reported in exposures table)
Visit tables (visits_long and visits)
visits_long
We build the visits_long table by concatenating the pid, dfseq, and study_day from all CRFs. It consequently has the following columns
# mismatched_study_days |> # ggplot() + aes(x = f_diff, fill = n |> factor()) + geom_histogram() +# facet_grid(n ~ .) +# xlab("Fraction of study days different from the most frequent study day")# # mismatched_study_days |> # filter(diff_with_most_freq_study_day != 0) |> # ggplot(aes(x = diff_with_most_freq_study_day)) +# geom_histogram(binwidth = 1) # # mismatched_study_days |> # filter(diff_with_most_freq_study_day != 0) |> # filter(abs(diff_with_most_freq_study_day) < 100) |> # ggplot(aes(x = diff_with_most_freq_study_day)) +# geom_histogram(binwidth = 1)
All mismatches are for screening visits and for just one participant (068200281), it looks like she had some of her 2120 CRF done at one visit and came back for the others.
visits
From the visit_long table, we build the visits table with one row per pid and visit_code and the following columns
study_day harmonized (= most frequent) (fixed) study_day across CRFs
visit_type: whether it is a clinic or home visit
visit_planned: whether the visit was planned or additional
visit_attended: whether the visit was attended or not. For home visits, this notes whether we have any CRF data for that visit
specimen_collection_swab (collected, not collected, unclear) [CRF35]
specimen_collection_softcup (collected, not collected, unclear) [CRF35]
specimen_collection_cytobrush (collected, not collected, unclear) [CRF35]
We do not document the home swab collection because it looks like there are more mistakes in the CRF data (CRF47) than in the daily swab manifest.
visits_variable_dictionary <-tibble(variable =colnames(visits) ) |>mutate(description =case_when( variable =="pid"~"Participant ID", variable =="visit_code"~"4-digit visit code (character)", variable =="study_day"~"Relative study day with respect to the enrollment visit (day 0)", variable =="n_distinct_study_days"~"Number of distinct study day for that visit code (most often a number larger than 1 denotes a re-screening visit)", variable =="crf_plates"~"CRF plates that have been filled at that visit for that participant.", variable =="visit_attended"~"Whether the visit was attended by the participant. For home visit, it is whether a CRF was filled for that visit code.", variable =="visit_planned"~"Whether the visit was planned or not.", variable =="visit_type"~"Type of the visit: clinic or home (or unclear for some 'unexpected' visit codes)", variable =="specimen_collection_swab"~"Number of self-collected swabs at that visit (CRF35)", variable =="specimen_collection_softcup"~"Whether softcup was collected at that visit (CRF35)", variable =="specimen_collection_cytobrush"~"Whether endocervical cytobrush was collected at that visit (CRF35)",TRUE~"???" ) )
Code
visits_variable_dictionary |>gt()
variable
description
pid
Participant ID
visit_code
4-digit visit code (character)
study_day
Relative study day with respect to the enrollment visit (day 0)
n_distinct_study_days
Number of distinct study day for that visit code (most often a number larger than 1 denotes a re-screening visit)
crf_plates
CRF plates that have been filled at that visit for that participant.
visit_attended
Whether the visit was attended by the participant. For home visit, it is whether a CRF was filled for that visit code.
visit_planned
Whether the visit was planned or not.
visit_type
Type of the visit: clinic or home (or unclear for some 'unexpected' visit codes)
specimen_collection_swab
Number of self-collected swabs at that visit (CRF35)
specimen_collection_softcup
Whether softcup was collected at that visit (CRF35)
specimen_collection_cytobrush
Whether endocervical cytobrush was collected at that visit (CRF35)